An Approach to Making SPAI and PSAI Preconditioning 
Effective for Large Irregular Sparse Linear Systems* 

Zhongxiao JW Qian Zhang* 



Abstract 

We investigate the SPAI and PSAI preconditioning procedures and shed light on two 
important features of them: (i) For the large linear system Ax = b with A irregular sparse, 
i.e., with A having s relatively dense columns, SPAI may be very costly to implement and 
the resulting sparse approximate inverses may be ineffective for preconditioning. PSAI 
can be effective for preconditioning but may require excessive storage and be unacceptably 
time consuming; (ii) the situation is improved drastically when A is regular sparse, that 
is, all of its columns are sparse. In this case, both SPAI and PSAI are efficient and more 
likely to construct effective preconditioners. Motivated by these features, we propose an 
approach to making SPAI and PSAI more practical for Ax = b with A irregular sparse. 
We first split A into a regular sparse A and a matrix of low rank s. Then exploiting the 
Sherman-Morrison- Woodbury formula, we transform Ax = b into s+1 new linear systems 
with the same coefficient matrix A, use SPAI and PSAI to compute sparse approximate 
inverses of A efficiently and apply Krylov iterative methods to solve the preconditioned 
linear systems. We show how to recover an approximate solution of Ax = b from those of 
the s+1 new systems. Based on the relationship between approximate solutions, we design 
reliable stopping criteria for the s + 1 systems to guarantee that the approximate solution 
of the original Ax = b satisfies a desired accuracy. Given the fact that irregular sparse 
linear systems are common in applications, this approach widely extends the practicability 
of SPAI and PSAI. Numerical results demonstrate the considerable superiority of our 
approach to the direct application of SPAI and PSAI to Ax = b. 
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1 Introduction 

Krylov iterative solvers [9j [T7] have been very popular for the large sparse linear system 

Ax = b, (1) 

where A is a nonsingular n x n matrix and b is an n-dimensional vector. However, when 
A has bad spectral property or is ill conditioned, the solvers generally exhibit extremely 
slow convergence and necessitate preconditioning techniques. Sparse approximate inverse 
(SAI) preconditioning is for general purpose and aims to compute a preconditioner M ~ A" 1 
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directly so as to improve the conditioning of ([I]) for the vast majority of problems [31 117] . 
One typical SAI preconditioning technique is the adaptive sparse approximate inverse (SPAI) 
procedure [11], which has been widely used and proven quite effective for a broad range of 
matrices. The recently proposed adaptive power sparse approximate inverse (PSAI) proce- 
dure with dropping [15], called PSAI(toZ), is also an effective SAI preconditioning technique 
and has been shown to be at least competitive with SPAI numerically and can outperform 
SPAI for some practical problems. 

The SPAI and PSAI(ioZ) procedures fall into the category of Frobenius norm (hereafter 
abbreviated as F-norm) minimization based SAI preconditioning. During loops they solve a 
sequence of constrained optimization problems of the form 

min \\AM-I\\ F , (2) 

where M is the set of matrices with a given sparsity pattern S. Denote by M k the set of 
n-dimensional vectors whose sparsity pattern is S k = {i\(i,k) G S}. Then ([2]) is decoupled 
into n independent constrained least squares (LS) problems 

min ||.Amfc — efc||, k = 1, 2, . . . , n, (3) 

m k eM k 

with e k the k-th column of the n x n identity matrix /. Here and hereafter, the norm || • || 
denotes the vector 2-norm or the matrix spectral norm. For each k, let fh k = m k (Sk), £k be 
the set of indices of nonzero rows of A(:,Sk), Af, = A{C k ,Sk) and e k = efc(£fc). Then t$S§ is 
reduced to the smaller unconstrained LS problems 

min \\A k m k - e k \\, k = l,2,...,n, (4) 

which can be solved by the QR decomposition in parallel. SPAI and PSAI(toZ) determine 
the sparsity pattern S k dynamically: starting with a simple initial pattern, which is usually 
taken as the pattern of e k , S k is augmented or adjusted adaptively until the residual norm 
|| Am k — e k \\ falls below a given tolerance or the maximum number of augmentations is reached. 
The distinction between SPAI and PSAI(toZ) lies in the way that S k is augmented or adjusted. 
The already existing positions of nonzero entries in ra k for SPAI are retained in the later 
loops and only a few new members, called most profitable indices in [11], are added per 
loop. PSAI(toZ) aims to drop the small entries whose sizes are below a certain tolerance 
and only retain the remaining large ones during the determination of M, and S k is adjusted 
dynamically by not only absorbing new members but also discarding the positions where the 
entries of m k become small during loops. Because of these essential differences, PSAI(toZ) 
may better capture effective sparsity patterns of A -1 than SPAI. 

Throughout the paper, we will frequently use two keywords "regular sparse" and "irreg- 
ular sparse" for a matrix. By regular sparse, we, qualitatively and sensibly, mean that all 
the columns of the matrix are sparse and no column has much more nonzero entries than 
the others. By irregular sparse, we mean that there are some (relatively) dense columns, 
each of which has considerably more nonzero entries than the other sparse columns. We call 
such columns irregular and denote by s the number of them. Problem (TjQ) with A irregular 
sparse are quite common and arises in semiconductor device problem, power network prob- 
lem, circuit simulation, optimization problems, computer graphics/vision problem, economic 
problems, material problems, theoretical/quantum chemistry problem and many others [TJ. 
Quantitatively, we will declare a matrix irregular sparse if its densest column has Wp entries 
or more, where p is the average number of nonzero entries each column. We will defer quan- 
titative details on these two keywords to Section [3l Under this definition, we investigate all 
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the matrices in the collection [7], which contains 2649 matrices and of them 1978 are square. 
We find that 682 out of these 1978 matrices are irregular sparse. That is, there are 34% of 
the matrices in the collection which are irregular sparse. In the collection, there are some 
social networks, citation networks, and other graphs that are not typically viewed as linear 
systems. They often have dense columns. But even if these matrices are removed, there are 
30% irregular matrices, 464 out of 1554, in the collection. So irregular sparse linear systems 
are quite common in applications. We will give more illustrative details on irregular matrices 
in Section HI where we clearly show where irregular sparse linear systems are from, how big 
s can be and how dense irregular columns 

The possible success of any SAI preconditioning procedure is based on the crucial assump- 
tion that A has good sparse approximate inverses. Under this assumption throughout the 
paper, we consider the case that A is irregular. It is empirically observed that good sparse 
approximate inverses of A are irregular too. Nevertheless, since the number of nonzero entries 
in any single column of the final M in SPAI is bounded by the number of most profitable 
indices per loop times the maximum loop steps, a column of M may have not enough or 
excessive nonzero entries. As a result, M's obtained by SPAI may not approximate A -1 well 
and thus may be ineffective for preconditioning, or it is a good sparse approximate inverse 
of A but some columns of it are denser than they should be. We will justify theoretically 
in subsection 12.11 and confirm numerically in subsection 14. II that the SPAI algorithm may be 
very costly to implement. For PSAI(toZ), we may also suffer the unaffordable overhead from 
solving some possibly large LS problems dU), although it is more likely to construct effective 
preconditioners no matter wether A is regular sparse or not. Remarkably, it turns out that 
the situation mentioned above is improved very substantially if A is regular sparse. With 
reasonable parameters, it is not only possible to implement SPAI and PSAI(ioZ) efficiently 
but also to obtain possibly effective sparse approximate inverses. In Section 2, we will give 
detailed arguments to clarify these claims. 

It is well known [5] that the computational consumption, stability and effectiveness of 
factorized SAI preconditioners are generally sensitive to reorderings of A. Unfortunately, it 
appears that reorderings do not help for SPAI and PSAI(ioZ). The reason is that reorderings 
do not change the irregularity of sparsity patterns of A, A" 1 and good sparse approximate 
inverses of A. Therefore, with orderings, SPAI and PSAI(io£) may still be very costly to 
implement, and SPAI may still be ineffective for preconditioning. We refer the reader to [3] 
for the relevant arguments about SPAI, which are valid for PSAI(toZ) as well. 

Based on our previous statements, we naturally come up with the idea of transforming the 
irregular sparse problem (pTJ) into some regular sparse one(s), on which SPAI and PSAI(toZ) 
work well. It will appear that the Sherman-Morrison- Woodbury formula [1Q|, [T8] is a key 
step towards our goal and provides us a powerful tool. We present an approach to splitting A 
into a regular sparse matrix A and a matrix of low rank s and transforming ([1]) into the s + 1 
new linear systems with the same coefficient matrix A. By exploiting the Sherman-Morrison- 
Woodbury formula, we can recover the solution of ([1]) from the ones of the s + 1 systems 
directly. We consider numerous practical issues on how to obtain a desired splitting of A, 
how to define and compute an approximate solution of ([T]) via those of the new systems and 
how accurately we should solve the new systems. A remarkable merit of this approach is that 
SPAI and PSAI(foZ) can construct effective preconditioners for the new systems efficiently, 
making Krylov solvers converge fast. The price we pay is to solve s + 1 linear systems. But 
a great bonus is that we only need to construct one effective sparse approximate inverse 
M efficiently for the s + 1 systems. The price is generally insignificant as it is typical that 

1 We thank Professor Davis, one of the authors of [7] very much for providing us such a very valuable data 
analysis, clearly showing that irregular sparse linear problems are quite common. 
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the construction of M dominates the whole cost of preconditioned Krylow solvers even in 
a parallel computing environment. As a matter of fact, due to inherent parallelizations of 
SPAI and PSAI(toZ), SAI type precondtioners are attractive for solving a sequence of linear 
systems with the same coefficient matrix, as has been addressed in the literature, e.g., [I]. 
As a consequence, given the fact that irregular sparse linear systems are quite common in 
applications, our approach widely extends the practicality of SPAI and PSAI. 

The paper is organized as follows. In Section 2, we review SPAI and PSAI(toZ) procedures 
and shed light on the above-mentioned features for A irregular sparse and regular sparse, 
respectively. In Section 3, we describe our new approach for solving (UJ with A irregular 
sparse and address some theoretical and practical considerations. In Section 4, we report 
numerical experiments to demonstrate the superiority of our approach to SPAI and PSAI(toZ) 
applied to ([1]) directly and the superiority of PSAI(toZ) to SPAI for both irregular and regular 
sparse linear systems. Finally, we conclude the paper in Section 5. 

2 The SPAI and PSAI(foZ) procedures 

In this section, we give an overview of the SPAI and PSAI(toZ) procedures and shed light on 
the facts that SPAI and PSAl(toZ) are costly, may be impractical to implement and SPAI 
may not be effective for preconditioning if A is irregular sparse. 

2.1 The SPAI procedure 

Denote by Slj the sparsity pattern of after I loops of augmentation starting with a 
given initial pattern si , and by Cu the set of indices of nonzero rows of A(:,SP). Let 
A k = A(£^\ S^), e-k = e k (C k ), fh k be the solution of ((D). Then the residual of ^ is 

r = A(:,sjV)rh k - e k . 

For r ^ 0, denote by C the set of indices I for which r(l) ^ 0, and by N the set of indices of 
nonzero columns of A(C, :). Then 

J = A^Vf (5) 
constitutes the new candidates for augmenting sjp in the next loop. Grote and Huckle [11] 



suggest to select several most profitable indices from J and augment them to sjp to obtain 

a new sparsity pattern S k +1 ^ of m^. They do this as follows: for each j G J , consider the 
one-dimensional minimization problem 

min \\r + HjAej \\, (6) 

whose solution is 

r T Aej 

^■ = -p^P (7) 

where the superscript T denotes the transpose of a vector or matrix. The 2-norm pj of the 
new residual r + (ijAej satisfies 



( r T A ej ) 2 
\Ae^ 2 



Pi = WAV ~ ( 8 ) 



The set sjp of the most profitable indices j consists of those associated with a few, say, 1 
to 5, smallest pj's and is added to to obtain S^ 1 ^ . Update C k to get C k l+1 ^ by adding 
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the set C k of indices of new nonzero rows corresponding to S k . The new augmented LS 
problem (j3D is solved by updating m,\. instead of resolving it. Proceed in such a way until 
||.Amfc — efc|| < 5 or I attains the prescribed maximum loop steps l max , where 5 is a given 
fairly small tolerance, say 0.1 ~ 0.4. 

For a given small maximum loop steps l max , Huckle |13j has established effective upper 
bounds for the pattern of M obtained by SPAI, and shown that the patterns of (A T A) 1 ™** A T , 
(I + \A\ + \A T \) lrn ^A T and (I + A) lmax are good envelop patterns of M. 

Remark 1. Let us consider the computational complexity of SPAI. When A is regular 
sparse, it is straightforward to verify that r = Ami- — e& is also sparse and both C and J have 
only a few elements. As a consequence, the cardinal numbers of s[ s are small, so are the 
orders of A^s in @. Therefore, it is cheap to determine the set sjp of the most profitable 
indices and solve (JU). However, the situation deteriorates severely when A is irregular sparse. 
For example, assume that the k-th column of A is relatively dense, and suppose A{k, k) ^ 
and take S^P' = {k}. Then in the first loop of SPAI, r = mt c (k)ak — &k is as dense as a^, 
causing that C has a very big cardinal number and J has a lot of elements. This situation 
is extended in subsequent loops. As a consequence, suppose that a& is fully dense, at each 
loop we have to compute almost n numbers Pj's, order them and select the most profitable 
indices in J . So SPAI may be very expensive whenever A is irregular sparse. 

Remark 2. For the irregular sparse A, suppose that it has good sparse approximate 
inverses. Then they are typically irregular sparse too. Suppose the k-th column of a good 
sparse approximate inverse is irregular. The description of SPAI shows that if is irregular, 
i.e., relatively dense, then the set J in (0) has big cardinal number during loops. Nevertheless, 
SPAI simply takes the set Sjp to be only a few most profitable indices from J . If the 
cardinality of <§£ is fixed small, say 5, the default value as suggested and used in [TJ [2J [XT] , 
then the fe-th column of M is sparse and may not approximate the k-th. column of A" 1 well 
unless the loop steps Z max is large enough. Since the number of nonzero entries in any single 
column of the final M in SPAI is bounded by the number of most profitable indices per loop 
times the maximum loop steps l max , which is fairly small, say 20, the default used in [H[2]) a 
column of M may have not enough or excessive nonzero entries. As a result, M's obtained by 
SPAI may not approximate A -1 well and thus may be ineffective for preconditioning, or it is 
a good sparse approximate inverse of A but some columns of it are denser than they should 
be. On the other hand, if the cardinality of S k is allowed to vary and instead taken big, the 
resulting M may be a good preconditioner. However, we have to solve some LS problems 
of large sizes, causing that SPAI may be very costly and even impractical to implement. 

2.2 The PSAI(toZ) procedure 

PSAI(to/) is different from SPAI in the way that selects and adjusts S, i.e., <Sfc, k = 1,2, ... ,n. 
We first review the basic PSAI (BPSAI) procedure |15j . From the Cayley-Hamilton theorem, 
A^ 1 can be expressed as a matrix polynomial of A of degree d — 1 with d < n: 



with A = I and Cj (i = 0, 1, . . . , d — 1) being certain constants. Denote V(-) by the sparsity 
pattern of a matrix or vector. It is obvious that V(A~ 1 ) C V((I + A)^ 1 ). For a given small 
positive number l mSuX <C d, the pattern S of the sparse approximate inverse M is taken as a 
subset of V((I + A) 1 ™**) in BPSAI. Therefore, the pattern Sk of the A;-th column nik of M, 



d-l 




(9) 
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^max 

is a subset of (J V(A e k ) since 
1=0 



V({I + A) 1 ^) = |J 



z=o 

The adjustment of S k and the construction of m k for 1 < k < n proceed dynamically 
as follows. For I = 0, 1, ... , l mSLK , denote by s9 the sparsity pattern of m k at the current 
loop step I and by £ k the set of indices of nonzero rows of A(:,S ( k l) ). Set aj, = A l e k . Then 
a^ 1 = ^4a z fe . So the pattern of o^, +1 can be determined. The sparsity pattern <S^ +1 ^ is updated 
as = T~ > ( a k +1 ) U >5fc ■ I n the next loop we solve the augmented LS problem 

min P(4 W) ,sf 1) )m fe (^ 1 ) - e fc ^ +1 )|| 

with >c[,' +1 ^ = £i U Cu , where C, is the set of indices of new nonzero rows corresponding 

to the set V{a l k +1 )\S < f^ of indices of the newly added columns. This problem corresponds to 
the small LS problem Q, whose solution can be updated from m k (S l k ) efficiently Proceed 
in such a way until ||^4mfc — e^H < 5 or / > / ma x- 

It has been proved in [T5J Theorem 1] that for s[ = {k}, as Z max increases, M obtained 
by BPSAI may become denser and denser quickly once one column in A is irregular sparse. In 
order to control the sparsity of M and construct an effective preconditioner, some reasonable 
dropping strategies should be used. Two practical PSAl(lfill) and PSAI(toZ) algorithms have 
been proposed in [15]. PSAI(7/i/Z) aims to retain at most Ifill large entries of m k . To be 
more flexible, Ifill may vary with k. Since the numbers of large entries in the columns of 
A -1 are generally unknown in advance, it is difficult to choose a reasonable Ifill in advance. 
This is a shortcoming similar to SPAI. In contrast, for the newly computed m k at loop Z, 
PSAI(toZ) drops those small entries below a prescribed tolerance tol and retains only large 
ones. Therefore, PSAI(ioZ) is more reasonable and reliable to capture an effective approximate 
sparsity pattern of A~ 1 and determine its large entries. A central issue is the selection of 
dropping tolerance tol. This issue is mathematically nontrivial, and tol has strong effects 
on the effectiveness of PSAI(ioZ) and many other SAI preconditioning procedures. Very 
recently, the authors [13] have established a systematic and solid theory on dropping criteria 
for PSAI(ioZ) and all the static F-norm minimization based SAI preconditioning procedures. 
For PSAI(toZ), the authors show that, at step I < l max , a nonzero entry nij k is dropped for 
1 < j < rt if it satisfies 

\m jk \<tol k = — — , k = 1,2, . . . ,n, (10) 

nnz{m k j\\A\\i 

where nnz(-) is the number of nonzero entries in a vector or matrix and 5 is the stopping 
tolerance for SPAI and BPSAI. We remind that m k in (|10p is the newly computed one at 
loop I. This criterion has been shown to be robust and effective, making M as sparse as 
possible and have similar preconditioning quality to the possibly much denser one obtained 
by BPSAI. More precisely, for the final M obtained by PSAI(foZ), it is proved in [J3] that if 
(|10p is used then the residual norm ||^4mfc — e k \\ < 25, k = 1, 2, . . . , n under the assumption 
that the the residual norm of each column of the preconditioner obtained by BPSAI falls 
below 5. 

The dropping criterion (|10|) also applies to any static F-norm minimization based SAI 
preconditioning procedure [2], where 5 is replaced by 5 k = \\Am k — e k \\ and m k is computed 
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by the static SAI preconditioning procedure without dropping for an a-priori given sparsity 
pattern S k . Write the sparsification of m k as m k in such a way. Then it is known p3] that 
\\Am k - e k \\ < 2S k , k = 1,2, . . . ,n. 

Remark. If A is regular sparse, it is direct to justify that the size of A k is small for a 
small / max and the cost for solving Q is cheap. However, the situation is sharply different 
when A is irregular sparse. Suppose that the k-th column a k of A is relatively dense and 
A{k, k) ^ 0, and let s£ = {A;}. Then the corresponding m k is also relatively dense since 

its pattern = V(a k ) U {k} = V{a k ) at the first loop. Therefore, (@D is a relatively large 
LS problem no matter whether dropping is used or not. Generally, if the number of nonzero 
entries in a k is comparable to n, then we have to solve a LS problem whose size is comparable 
to n, causing that PSAI(ioZ) is impractical. 



3 Transformation of (ED) into regular sparse linear systems 

As we have already seen, when A is irregular sparse, both the SPAI and PSAI(toZ) procedures 
encounter some serious difficulties and M obtained by SPAI may be ineffective for precondi- 
tioning (PQ). In this section, we will attempt to transform (pQ) into some regular sparse ones, 
for which it is possible to use SPAI and PSAI(foZ) to construct effective preconditioners effi- 
ciently. It turns out that the Sherman-Morrison- Woodbury formula will play a central role 
for our purpose. We will address numerous practical issues on the determination of irregular 
columns and the construction of A, etc. 

We now review the Sherman-Morrison-Woodbury formula |18| p. 330]. 

Theorem 1. Let U, V G M. nxs with s < n. If A and A — UV T are nonsingular, then 
I — V A~ lT J is nonsingular and 

(A - UV T )~ l = A~ l + A~ X U(I - V T A" 1 Uy 1 V T A~ 1 . (11) 

In particular, if U = u and V = v are vectors, then 

(A-u^ = A-l + 1 _ vTA _ lu (12) 

If s < n and U and V are of full column rank, the theorem says that a rank s correction to 
A results in a rank s correction to A" 1 . We should comment that the theorem itself requires 
neither of the two conditions. However, the formula is typically of interest for s n, and 
with s = 1 it reduces to the Sherman-Morrison formula. For a good survey on the history 
and some applications, we refer the reader to |12j . 

For our purpose, assume that the ji, j2, . . . , J s -th columns of A are irregular and the 
remaining n — s ones are sparse. Denote by A^c = (aj 1 ,aj 2 , . . . ,aj g ) the matrix consisting of 
the s irregular columns of A, by A^ = (5^, a j2 , . . . , a Js ) the sparsification of A^ that drops 
some of its nonzero entries, so that each column of A& c is as sparse as the other n — s columns 
of A. Define U = A^ — A^ = (u±, u%, . . . , u^. Then U just consists of those dropped nonzero 
entries of A^c- Denote by A the regular sparse matrix that replaces aj i by a,j i , i = 1, 2, . . . , s. 
Then A is split into 

A = A + UV T , (13) 

where V = (e,^, e,- 2 , . . . , e,- s ) and is the the ji-th. column of the n xn identity matrix 
/, i = 1,2, ...,s. Assume that A is nonsingular. Then it follows from Theorem [T] that 
/ + V T A~ lj J is nonsingular and 

A' 1 = A- 1 - A~ l U(I + V T A- 1 U)- 1 V T A~ 1 . (14) 
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Therefore, the solution of (JT|) is 

x = A~ x b = A~ x b - + V T (A- 1 U))- 1 (V T A- 1 b). (15) 

This amounts to solving a new regular sparse linear system 

Ay = b (16) 

and the other s regular sparse linear systems 

Awj = Uj, j = 1,2, ...,s. (17) 

If the exact solutions to (|16p and (|17j) were available, we would compute the solution x of ([1]) 
from (|15p by solving the small s x s linear system with the coefficient matrix / + V T (A~ l U) 
and the right-hand side V T A~ 1 b. This could be done by any numerically stable direct solver 
very cheaply. 

We can summarize the above approach as Procedure *. 

Procedure *: Solving the irregular sparse linear system ([1]) 

1: Find s and A^ c and sparsify A^ to get A^. Define U = — A^ and the regular sparse 

matrix A = A — UV T , where V = (ej 1 ,ej 2 , . . . ,e Js ). Assume that A is nonsingular. 
2: Solve s + 1 linear systems (fT6|) and (fT7|) for y and iwi, u?2, ■ ■ ■ , w s , respectively. 
3: Let W = (wi,W2, ■ ■ ■ ,w s ) and compute the solution x of Ax = b by 

x = A~ 1 b = y -W(I + V T W)-\V T y). (18) 



Remark. If A has only one irregular column and j\ = k, we only need to solve two linear 

systems Ay = b and Aw = u, and (fT8|) reduces to x = j4 _1 6 = y — - i ^^ w w - 

For regular sparse systems (fT6j) and (fT7|) , we suppose that only iterative solvers are viable 
in our context. Now, a big and direct reward is that SPAI and PSAI(to^) can be used much 
more efficiently to construct effective preconditioners for the s + 1 regular sparse systems than 
the original irregular sparse ([T]). So it is expected that the preconditioned Krylov solvers can 
converge quickly. 

In order to use Procedure * to develop a practical iterative solver for ([1]), we need to 
handle several practical issues. Also, we will see that recovering an approximate solution of 
dl]) via those of (fT6|) and (fT7|) is involved and is not as simple as Procedure * indicates, in 
which the exact y and W are assumed. 

The first issue is about the quantitative meaning of irregular columns, by which we define 
Adc and U. Obviously, like sparsity itself and many other quantities in numerical analysis, 
it appears impossible to give a precise definition of it. In fact, it is also unnecessary to do 
so. In our experiments, we empirically find that the threshold Wp is a good choice, where p 
is the average number of nonzero entries per column of A. If the number of nonzero entries 
in a column exceeds lOp, then we regard the column as an irregular one. Based on this 
criterion, we determine all the irregular columns of A and the number s of them. We point 
out that other thresholds ranging from 8p to 15p work equally well. So the efficiency and 
effectiveness of SPAI and PSAI(toZ) is quite insensitive to thresholds. This is very appealing 
for the robustness and generality of our algorithm. 

The second issue is which nonzero entries in A^c should be dropped to get A^ and 
generate a regular matrix A. In principle, the number p of nonzero entries in each column 
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of Ada should be comparable to the average number p of nonzero entries per column of A. 
Given p, there may be many dropping ways. We propose two approaches. The first one is 
to retain the diagonal and p—1 nonzero entries nearest to the diagonal and drop the others 
in each column of A^. The second approach is to retain the diagonal and the other p—1 
largest entries and drop the others in each column of A^c- For the choice of p, to be unique, 
we suppose taking p = p simply. Numerically, two approaches have exhibited very similar 
behavior. We will only report the results obtained by the first approach. 

The third issue is about the sparsity, non-singularity and conditioning of A and the 
existence of good sparse approximate inverses of A. As is clear from its construction, A 
is sparser than A. For the nonsingular A, although we cannot theoretically guarantee its 
non-singularity, A should be nonsingular and the (numerically) singular case must be rare 
in practice. As for the conditioning of A, it is possible to get either a well-conditioned or 
ill-conditioned A. Given that A is ill conditioned, the former is naturally welcome, but we 
should not expect such a luck; the latter is more possible. Theoretically speaking, A may be 
worse or better conditioned than A. Happily, numerical experiments will indicate that A's 
are often, though not always, better conditioned than A's. Actually, what we have found is 
that for 8 out of 12 test matrices, ^4's are considerably better conditioned than ^4's and for 
the other four matrices, the condition numbers of each A and the corresponding A are very 
near. Regarding the existence of good sparse approximate inverses of A, we, informally and 
heuristically, argue as follows: for a given positive integer / maX5 the pattern V((I + A) Zmax ) 
is generally sparser than V((I + A) 1 ™**). Since the sparsity patterns of approximate inverses 
obtained by SPAI and PSAI(toZ) are effectively bounded by V((I+ A)' max ) and V((I+ A) /max ), 
respectively, it is expected that A has good sparse approximate inverses when A does. 

The last important issue is how we should select stopping criteria for Krylov iterations 
for s + 1 linear systems so as to recover an approximate solution of (pQ) with the prescribed 
accuracy. It is seen from (|18p that the solution x of (pQ) is formed from the ones of the s+1 new 
systems. Recall that the s + 1 linear systems are now supposed to be solved approximately by 
preconditioned Krylov solvers. Our concerns are (i) how to define an approximate solution 
x from the s + 1 approximate solutions of (|16|) and (|17|) and (ii) how accurately we should 
solve (fT5j) and (fT7|) such that x satisfies jj^jj = ^ra^ < e - As it appears immediately, it is 
direct to settle down the first concern, but the second concern is involved. 

Theorem 2. Let y and Wj, j = 1, . . . , s be the approximate solutions of Ay = b and Avjj = 
Uj, j = 1, 2, . . . , s, respectively, and define W = (w\, . . . , w s ), ry = b — Ay, r&j = Uj — Awj. 
Assume that I + V T W is nonsingular with V = (ej 1 , ej 2 , . . . , ej s ), and define c = + 



F T W)- 1 (F T j/)||. Let 



x = y- W(I + V T W)~ l {V T y) 



(19) 



be an approximate solution o/([T]). Then if 




s 



(20) 



and 




e, j = 1,2,... ,s 




we have 




b- Ax 




< e. 
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Proof. Replacing W and y by their approximations W and y in (|18|) . we get (|19fl . which is 
naturally an approximate solution of (H|). Define = U — AW. We obtain 

r = b - Ax = b - Ay + AW (I + V T W0 _1 (V T ?/) 
= 6 - (I + C/y T )y + (A+ UV T )W{I + V^J-^V^) 

zjy T i/ + uv T w(i + v^wo -1 ^!/) + Aw (i + y T #) _1 (y T y) 
[/(/ - v T w{i + y T w)- 1 )(y T y) + iW(J + v T w)-\v T y) 

U(I - (/ + F T W^ - /)(/ + V T W)- l ){V T y) + AW {I + F T W) _1 (F T y) 
[7(1 + V T W)' 1 (V T y) + AW"(7 + ^W) -1 ^^) 
R w (I + V T W)-\V T y), 



r 



from which it follows that 



< ||rjf|| + cH-R^H < \\ry\\ + cII^Hf- (23) 



By definition and R^(:,j) = r^, we have H-R^Jf = yjYfj=i ll r iij 2 - If © and 

1 1 ^ilj 1 1 s 

-jiljr- < 7T^' J = 1,2,... , a 
|o|| 2ysc 

(|23p means that (|22p holds. Since the above relation is just (|2ip . the theorem holds. □ 

We point out that because of (|23p our stopping criterion (|2ip may be conservative. Fur- 
thermore, we comment that c is moderate if I + V T W is well conditioned and it may be large 
if I + V T W is ill conditioned. Since s is supposed very small, this theorem indicates that we 
should solve the s linear systems (|17p roughly with the accuracy at the level of e. We may 
solve them by Krylov solvers either simultaneously in the parallel environment or indepen- 
dently in the sequential environment. An alternative approach is to solve them using block 
Krylov solvers for very small s. Note that c cannot be computed until iterations for (|16|) and 
(fT7|) terminate but the stopping criterion (f2T|) depends on c. Therefore, the computation of 
c and termination of iterations interacts. In implementations, we simply replace c by one in 
(|2ip and stop iterative solvers for the s systems (|17p with the modified accuracy requirement. 
Because of this inaccuracy, (|22p may fail to meet but ||r||/||6|| should be at the level of e 
generally. In later numerical experiments, we will find that c = 1 works very well and makes 
(|22p hold for almost all the test problems and the right-hand sides of (|22p are only a little 
bit bigger than e in the very rare cases where (|22p does not meet. 



4 Numerical experiments 

In this section, we test our approach and compare it with the algorithm that preconditions ([I]) 
by SPAI and PSAI(toZ) directly and then uses Krylov iterative methods to solve the resulting 
preconditioned systems. We report the numerical experiments obtained by the Biconjugate 
Gradient Stablized (BiCGStab) with SPAI and PSAI(foZ) preconditioning on ([T]) and (fTUj) . 
(|17p . respectively. Such combinations give rise to four algorithms. For brevity, we do not 
write BiCGStab explicitly and them Standard-SPAI, New-SPAI, and Standard-PSAI(toZ) and 
New-PSAI(>0, abbreviated as S-SPAI, N-SPAI and S-PSAI(toZ), N-PSAI(toZ), respectively. 
We will first demonstrate the very considerable superiority of N-SPAI to S-SPAI and that 
of N-PSAI(toZ) to S-PSAI(toZ). Then we compare the performance of PSAI(toZ) and SPAI. 
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We will illustrate that if the numbers of nonzero entries of preconditioners are required to 
be (almost) the same then PSAI(toZ) is more effective for preconditioning than SPAI for 
both irregular and regular sparse linear systems. Under such restriction to the sparsity of 
preconditioners, we are able to compare the preconditioning effectiveness of PSAI(to^) and 
SPAI. We will show that PSAI(ioZ) captures the positions of large entries in A" 1 and A" 1 
more effectively than SPAI and thus generate better precondiotioners. For other Krylov 
solvers, such as BiCG, CGS and the restarted GMRES(20), we have obtained analogous 
numerical results, from which the same conclusions can be drawn. So it suffices to only 
report and evaluate the results obtained by BiCGStab. 

Before testing our approach, we look into all the matrices in [7] and give illustrative 
information on where irregular sparse linear systems are from, how common they are in 
practice, how big s can be and how dense irregular columns. We divide matrices into their 
problem domains, sort them by percentage of matrices in that domain that are irregular. For 
example, there are 35 total semiconductor problems in the collection, of which 13 are regular 
and 22 irregular. Matrices labeled as "graphs" in the collection are excluded (many of those 
are irregular, but not all are a linear system). Here are details: 

• 100%: regular frequency-domain circuit simulation problem; 4 irregular frequency- 
domain circuit simulation problems 

• 100%: regular linear programming problem; 1 irregular linear programming problem 

• 63%: 13 regular semiconductor device problems; 22 irregular semiconductor device 
problems 

• 56%: 26 regular power network problems; 33 irregular power network problems 

• 51%: 124 regular circuit simulation problems; 132 irregular circuit simulation problems 

• 61%: 53 regular optimization problems; 82 irregular optimization problems 

• 33%: 2 regular computer graphics/vision problems; 1 irregular computer graphics/vision 
problem 

• 33%: 44 regular economic problems; 21 irregular economic problems 

• 25%: 6 regular counter-example problems; 2 irregular counter-example problems 

• 20%: 28 regular eigenvalue/model reduction problems; 7 irregular eigenvalue/model 
reduction problems 

• 11%: 25 regular material problems; 3 irregular material problems 

• 11%: 62 regular chemical process simulation problems; 8 irregular chemical process 
simulation problems 

• 11%: 8 regular statistical/mathematical problems; 1 irregular statistical/mathematical 
problems 

• 11%: 54 regular theoretical/quantum chemistry problems; 7 irregular theoretical/quantum 
chemistry problems 

• 10%: 118 regular 2D/3D problems; 13 irregular 2D/3D problems 

• 8%: 12 regular acoustics problems; 1 irregular acoustics problem 
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• 5%: 287 regular structural problems; 14 irregular structural problems 

• 3%: 28 regular combinatorial problems; 1 irregular combinatorial problem 

• 3%: 167 regular computational fluid dynamics problems; 5 irregular computational fluid 
dynamics problems 

• 3%: 30 regular thermal problems; 1 irregular thermal problem 

• 2%: 49 regular electromagnetic problems; 1 irregular electromagnetic problem 

• 0%: 2 regular least squares problems; irregular least squares problem 

• 0%: 47 regular model reduction problems; irregular model reduction problem 

• 0%: 4 regular other problems; irregular other problem 

• 0%: 2 regular random 2D/3D problems; irregular random 2D/3D problem 

• 0%: 3 regular robotics problems; irregular robotics problem 

The above data illustrates that irregular sparse Ax = b's are quite common and come 
from many applications. Below let us give more details on the information about the size of 
s and how dense irregular columns are. Figure Q] depicts what we are concerned with. In the 
top left plot, the x axis is the order of a square matrix. The y axis is the number s of dense 
columns. A circle is a matrix in the collection. The steep line is s = n, which is not actually 
achievable, and the flat line is s = y/n. This figure only plots matrices with at least one 
dense column. In the top right figure, each dot is a square matrix. The x axis is the mean, 
i.e., the average number, of nonzero entries in each column, which equals nnz(A)/n. The y 
axis is the maximum number of nonzero entries in any column divided by the mean for that 
matrix. The line parallel to the x axis is y = 10. Matrices have at least one dense column if 
they reside above the line y = 10. For each x, the bigger y, the denser the irregular column. 
The bottom 2 figures are the same, but with social networks and other graphs excluded, d 

These figures further demonstate that there are very dense columns for many matrices, 
irregular sparse matrices are common and s can be quite big. 

We test our approach on some of the above irregular sparse linear systems. A brief de- 
scription is presented in Table [TJ The right-hand side b of Ax = b was formed by taking 
the solution x = (1,1,..., 1) T . Taking the initial approximate solution to be zero for each 
problem and e = 10~ 8 in stopping criterion (|22p . we have found that BiCGStab without 
preconditioning did not converge for all the test problems within 500 iterations. We note 
that cbuckle is symmetric. So it should be better to design preconditioners to maintain the 
symmetry of preconditioned matrices, which can be achieved by using factorized or splitting 
preconditioners, so that Krylov solvers for symmetric problems can be applied. In our ex- 
periments, however, we used cbuckle in PSAI(toZ) purely for test propose and treated it as a 
general matrix. 

We conducted the numerical experiments on an Intel (R) Core (TM)2 Quad CPU E8400 
@ 3.00GHz with main memory 2 GB under the Linux operating system. The computations 
were done using Matlab 7.8.0 with the machine precision e mac h = 2.22 x 10~ 16 , and SPAI 
preconditioners were constructed by the SPAI 3.2 package [2] of Barnard, Broker, Grote 
and Hagemann, which is written in C/MPI. We used both SPAI and PSAI(ioZ) as right- 
preconditioning and took the initial patterns s£ = {k}, k = 1,2, ... ,n. We took c = 1 in 

2 The figures and the previous data analysis are due to Professor Davis, and we thank him very much. 
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Figure 1: Illustrations of irregular matrices in [7|. 



(|2ip and stopped iterations when (|20p and (|21[) were satisfied with e = 10 -8 or 500 iterations 
were reached. The initial approximate solution for each problem was zero vector. With x 
defined by (fT9|) . we computed the actual relative residual norm 

116 — Ax\\ . ,. 

rr = T (24> 

and compared it with the required accuracy e = 10 -8 . 

Before performing our algorithms, we carried out row Dulmage-Mendelsohn permutations 
[HJ [16] on the matrices whose diagonals have zero elements, so as to make their diagonals 
nonzero whenever necessary. The related Matlab commands are j = dmperm(A) and A = 
A(j, :). We applied dmperm to rajat04, rajatl2, tols4000 and ASIC_100k. The threshold for 
defining an irregular column is lOp. For all the matrices listed in Table [IJ A is constructed by 
retaining the diagonal entry and p—1 nonzero entries nearest to the diagonal and dropping 
the others in each of the irregular columns of A, and U is composed of the dropped nonzero 
entries. Table [2] shows some useful information about each A and A, including the order n, 
the number s of irregular columns in A, the average number p a of nonzero entries per column 
of A, the number p^ of nonzero entries in the densest column of A^c, the numbers nnz(-) of 
nonzero entries of A and A and the 2-norm condition numbers or their 1-norm estimates of 
A and A. 

We have some informative observations from Table [2j As is seen, except cbuckle and 



13 



Table 1: The description of test matrices, where k(A) = \\A\\ \\ A -1 || , 'sym' denotes the numeric 
value symmetry and we used the Matlab function condest to estimate the 1-norm condition 
numbers of the latter seven larger matrices. 



TV /T i " 

Matrix 


n 


nnz 


k(A) 


sym 


Description 


fs_541_3 


541 


4282 


2.83 x 10 


No 


2D/3D problem 


fs_541_4 


541 


4273 


1.17 x 10 


No 


2D/3D problem 


rajat04 


1041 


8725 


1.64 x 10 5 


No 


circuit simulation problem 


raj at 12 


10 IV 


lzolo 


o.yi x iu 


1NO 


circuit simulation problem 


tols4000 


4000 


8784 


2.36 x 10 7 


No 


computational fluid dynamics problem 


cbuckle 


13681 


676515 


3.30 x 10 7 


Yes 


structural problem 


ASIC_100k 


99340 


940621 


1.46 x 10 11 


No 


circuit simulation problem 


del 


116835 


766396 


1.01 x 10 10 


No 


circuit simulation problem 


dc2 


116835 


766396 


8.86 x 10 9 


No 


circuit simulation problem 


dc3 


116835 


766396 


1.16 x 10 10 


No 


circuit simulation problem 


trans4 


116835 


749800 


3.30 x 10 9 


No 


circuit simulation problem 


trans5 


116835 


749800 


2.32 x 10 9 


No 


circuit simulation problem 



tols4000, all the other test matrices have some almost fully dense columns. The matrix 
cbuckle and tols4000 are not so bad, but they indeed have one and 18 irregular columns under 
our definition. Precisely, except cbuckle and tols, raja04 and rajal2 have some columns that 
have more than n/2 nonzero entries, and all the other matrices have some fully dense columns. 
Table [2] shows that each A has a few irregular columns. The biggest two percentages 0.45% 
and 0.12% , which are equal to s/n, are reached for tols4000 and ASIC_100k, where s = 22 
and 122, respectively. It is remarkable that the eight ^4's are considerably better conditioned 
than the corresponding A's, where the condition numbers of ^4's are reduced by roughly one 
to five orders, compared with those of A's. For the other three matrices rajat04, rajatl2 and 
cbuckle, ^4's and ^4's have very near condition numbers. 

We next investigate the patterns of large entries in the "exact" inverses of an irregular 
sparse matrix and the regular sparse matrix induced from it. We only take rajat04 as an 
example. After performing row Dulmage-Mendelsohn permutation on it, we use the Matlab 
function inv to compute A -1 and A -1 and then drop their nonzero entries whose magnitudes 
fall below 10 -3 . We depict the patterns of sparsified A -1 and A -1 as (A) and (B) in Figure[2l 
respectively. It is clear that good approximate inverses of A and A are sparse but there are 
several dense columns in (A), which means that an effective sparse approximate inverse of A 
is irregular. On the other hand, the situation is improved substantially in (B), from which 
it is seen that a good sparse approximate inverse of A is regular sparse and it is sparser 
than the matrix in (A). These results typically demonstrate that good sparse approximate 
inverses of an irregular sparse matrix are also irregular but in contrast a regular sparse matrix 
has regular sparse approximate inverses. We have checked several other test matrices and 
have had the same findings. Perceptually, all of these have strongly supported our empirical 
assertions that the good sparse approximate inverses of an irregular or regular sparse matrix 
are very possibly irregular or regular sparse too. 

Here and hereafter, in all the tables, pt and st are the total CPU timings (in seconds) 
of constructing M and solving the preconditioned linear systems by BiCGStab, respectively, 
spar = nnz(M)/nnz(A) or nnz(M)/nnz(A) denotes the sparsity of M relative to A or A, 
and iter stands for the iteration number that BiCGStab used for ([1]) and the maximum of 
iteration numbers that BiCGStab used for the s + 1 systems (fl"6|) and (fT7|) . respectively. The 
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Table 2: Some information about A and A, where we used the Matlab function condest to 
estimate the 1-norm condition numbers of the latter seven larger matrices. 





n 


s 


T)„ 
Fa 


T)J 


TITLZ (^4) 


TITLZ (^4) 


k(A) 


k(A) 


fs_541_3 


541 


1 


7 


538 


4282 


3745 


2.83 x 10 11 


7.86 x 10 B 


fs_541_4 


541 


1 


7 


535 


4273 


3739 


1.17 x 10 1U 


1.64 x 10 s 


rajat04 


1041 


4 


8 


642 


8725 


7306 


1.64 x 10 s 


1.16 x 10 8 


raj at 12 


1879 


7 


6 


1195 


12818 


9876 


6.91 x 10 5 


6.53 x 10 5 


tols4000 


4000 


18 


2 


22 


8784 


8424 


2.36 x 10 7 


2.36 x 10 7 


chuckle 


13681 


1 


49 


600 


676515 


675916 


3.30 x 10 Y 


8.06 x 10 Y 


ASICUOOk 


99340 


122 


9 


92258 


940621 


742736 


1.46 x 10 11 


9.28 x 10 y 


del 


116835 


55 


6 


114174 


766396 


595757 


1.01 x 10 10 


2.17 x 10 8 


dc2 


116835 


55 


6 


114174 


766396 


595757 


8.86 x 10 y 


5.81 x 10 7 


dc3 


116835 


55 


6 


114174 


766396 


595757 


1.16 x 10 iU 


1.52 x 10 8 


trans4 


116835 


55 


6 


114174 


749800 


587459 


3.30 x 10 y 


3.46 x 10 8 


trans5 


116835 


55 


6 


114174 


749800 


587459 


2.32 x 10 9 


6.54 x 10 Y 



actual residual norm rr defined by (|24p is a ■ e, and we only list the multiple a in the tables. 
So a < 1 indicates that BiCGStab converged with the prescribed accuracy e. The size of a 
reflects wether taking c = 1 in (|21|) is reliable or not. 

4.1 Numerical results obtained by SPAI 

We take 5 = 0.4 as the stopping criterion for SPAI. Since effective sparsity patterns of A 
and A" 1 are generally unknown in advance, some key parameters, especially the number 
of most profitable indices per loop, involved in Procedure * can only be chosen empirically 
in order to control the sparsity of M and simultaneously to make M achieve the desired 
accuracy 5 as far as possible. In the experiments of this subsection, we fix the number of 
most profitable indices to be five per loop, which is also the default value in the code. The 
maximum number of loop steps £ max is set as twenty, bigger than the default value five in the 
code. The SPAI terminated whenever ||Amfc — e^H < 8, k = 1, 2, . . . , n or the the maximum 
loop steps l max was attained. It is run similarly for A. With the given parameters and the 
initial pattern = {k}, k = 1,2, ... ,n, the number of nonzero entries per column in the 
final M is bounded by 1 + 5 x 19 = 96. So if good preconditioners are irregular sparse, 
then SPAI may not be effective for preconditioning. We mention that we have omitted the 
symmetric matrix cbuckle since the matrix input in the SPAI 3.2 package is only supported 
by "real general" Matrix Market coordinate format. Table [3] lists the results. 

We make some comments on the results. First, we look at the efficiency of constructing 
M's by S-SPAI and N-SPAI. The notation V for the last six larger matrices indicates that 
S-SPAI could not compute M's within 100 hours because of the irregular sparsity of A's. 
Since each of these six matrices has fully dense irregular columns, the unaffordable time 
consumption results from the large cardinal numbers of JT's and Cs in Section 12.11 More 
precisely, for a fully dense irregular column of A, we have to compute almost n numbers /ij's 
in ([7]), sort almost n indices in C and select five most profitable indices among them at each 
loop. Carrying out these tasks is very time consuming. In contrast, N-SPAI did a very good 
job due to the regular sparsity of A 7 s. For the other matrices and given the parameters, 
the two M's obtained by S-SPAI and N-SPAI have very similar sparsity, but it is seen from 
ptfs that N-SPAI can be considerably more efficient to compute M's than S-SPAI, and the 
former may make great improvements over the latter and can be several times faster, e.g., six 
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1000 800 600 400 200 1000 800 600 400 200 

(A): nnz = 32340 (B): nnz = 2401 7 



Figure 2: rajat04: (A) and (B) are the patterns of sparsified A 1 and A , respectively. 

and seven times faster for raja04 and rajal2, respectively. For the last six matrices, N-SPAI 
computed all the M's within no more than half an hour to two hours, drastic improvements 
over S-SPAI! 

Second, we investigate the effectiveness of M's for preconditioning. To be more illuminat- 
ing, we recorded the number n c of columns in M which do not satisfy the accuracy 5 = 0.4 
for each matrix in S-SPAI or N-SPAI. We have observed that for the first four matrices, n c 's 
produced by N-SPAI are always smaller than those by S-SPAI. To some extent, this illustrates 
that it is more difficult for SPAI to compute an effective preconditioner when A is irregu- 
lar sparse. Indeed, as indicated by iter's, it is seen that for the first four matrices N-SPAI 
is more effective than S-SPAI. This confirms our claim that SPAI may be less effective for 
preconditioning irregular sparse linear systems. 

Third, we have found from a's that most of all the actual relative residual norms rr's in 
(J21I) dropped below e. The case that a > 1 happened only for ASIC_100k, but the actual 
relative residual norm rr = 1.37 x 10 -8 is already very near to the required accuracy e = 1CP 8 , 
indicating that the algorithm converged essentially. This means that taking c = 1 in (|21|) 
worked very well in practice. Note that iter for N-SPAI simply denotes the maximum of the 
iterations that BiCGStab used for s + 1 preconditioned linear systems. Therefore, it indicates 
the worst case but cannot tell us the convergence behavior of BiCGStab for each of the s + 1 
systems. Actually, for del, we have observed that only two of the s systems (fT7|) needed 
considerably more iterations than the others. The reason is that the two ||i>||/||zij||'s involved 
in the right-hand side of (|2ip are at least one order smaller than those of the remaining 
systems, so that more iterations were needed. 

Finally, we compare the overall performance of N-SPAI and S-SPAI. Naturally, for the 
last six larger matrices, S-SPAI failed to compute M's within 100 hours while N-SPAI was 
very efficient to do the same job and exhibited a very substantial superiority. For the other 
problems, N-SPAI is two to eight times as fast as S-SPAI to compute M's. We see that the 
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Table 3: Numerical results obtained by SPAI 





S-SPAI 


N-SPAI 




pt 


st 


spar 


iter 


a 


n c 


pt 


st 


spar 


iter 


a 


n c 


s 


fs_541_3 


0.33 


0.01 


1.34 


21 


0.40 


1 


0.17 


0.01 


1.52 


8 


0.62 





1 


fs_541_4 


0.14 


0.01 


0.81 


14 


0.01 


1 


0.04 


0.01 


0.91 


6 


0.68 





1 


raiat04 


1.17 


0.01 


0.36 


31 


0.20 


6 


0.16 


0.02 


0.40 


17 


0.57 


2 


4 


raj at 12 


3.19 


0.01 


0.86 


48 


0.31 


3 


0.39 


0.08 


1.08 


39 


0.80 





7 


tols4000 


0.10 


0.02 


1.01 


3 


0.44 


6 


0.05 


0.02 


1.02 


4 


0.48 


6 


18 


ASIC.lOOk 








* 






1619 


13.52 


0.66 


10 


1.37 


12 


122 


del 








* 






8216 


44.07 


1.71 


351 


0.61 


1 


55 


dc2 






* 




* 




6210 


41.75 


1.62 


76 


0.75 





55 


dc3 








* 






5243 


47.05 


1.63 


98 


0.91 


9 


55 


trans4 














3798 


9.13 


1.64 


39 


0.49 





55 


trans5 








* 






3546 


20.79 


1.57 


84 


0.65 





55 



construction of M's by N-SPAI dominates the total cost of our approach and the efficiency 
of N-SPAI compensates for the price of solving s + 1 linear systems. As a result, as far as 
the overall efficiency is concerned, our approach has exhibited a great superiority to SPAI 
applied to precondition (JTJ) directly. 

4.2 Numerical results obtained by PSAI(toZ) 

We look into the performance of N-PSAI(toZ) and S-PSAI(toZ) and show that the former is 
much better than the latter. In PSAI(toZ), we always take 5 = 0.4 and Z max = 10 to control 
the sparsity and quality of M for both S-PSAI(ioZ) and N-PSAI(ioZ). PS AI (to/) terminates 
when ||.Amfc — ejfc|| < 6 or the loop steps I > Z max - We use (fTU|) as the dropping tolerances. 
We test all the matrices in Table [T] and report the results in Table [U where l m is the actual 
maximum loop steps used for computing all the columns rrik, fc = 1,2,... ,n of M. 



Table 4: Numerical results obtained by PSAI(toZ) 





S-PSAI(foZ) 


N-PSAI(foZ) 




pt 


st 


spar 


iter 


a 




pt 


st 


spar 


iter 


a 




s 


fs.541.3 


0.36 


0.01 


1.55 


6 


0.31 


3 


0.29 


0.01 


1.63 


6 


0.30 


4 


1 


fs.541.4 


0.30 


0.01 


1.35 


5 


0.22 


3 


0.25 


0.01 


1.38 


5 


0.42 


3 


1 


rajat04 


1.27 


0.01 


0.72 


11 


0.79 


4 


0.28 


0.02 


0.51 


11 


0.58 


4 


4 


raj at 12 


3.48 


0.01 


2.22 


32 


0.74 


2 


1.40 


0.10 


1.92 


44 


0.56 


2 


7 


cbuckle 


3841 


1.62 


3.41 


85 


0.21 


5 


2322 


2.12 


2.76 


71 


0.55 


5 


1 


tols4000 


0.94 


0.01 


0.97 


2 


0.11 


1 


0.79 


0.01 


1.00 


2 


0.23 


2 


18 


ASIC.lOOk 














1429 


10.05 


0.81 


6 


0.78 


2 


122 


del 














3203 


31.36 


1.54 


201 


0.81 


3 


55 


dc2 














2975 


27.33 


1.62 


59 


0.58 


3 


55 


dc3 














2966 


24.10 


1.63 


56 


0.61 


8 


55 


trans4 














3073 


5.61 


1.79 


31 


0.44 


4 


55 


trans5 














2856 


9.86 


1.61 


58 


1.29 


4 


55 



The notation '-' for the last six larger matrices indicates that our computer was out of 
memory when constructing each M. The cause is that each A of them has some fully dense 
irregular columns, which result in some large LS problems So, PSAI(toZ) may encounter 
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an essential difficulty when A has some very dense columns. For all the other smaller matrices, 
S-PSAI(ioZ) can generate M's with the desired accuracy 5 = 0.4. Furthermore, we have 
seen that S-PSAI(foZ) always used nearly the same Z m as N-PSAI(ioZ) for each A but it is 
considerably more time consuming than N-PSAI(foZ). So, PSAI(ioZ) can capture effective 
sparse patterns of A" 1 and A" 1 for (almost) the same small loop steps I and determine 
large entries of them reliably. This is a distinctive feature that S-SPAI and N-SPAI do not 
share, where SPAI may be ineffective for preconditioning when A is irregular sparse and the 
situation is improved when A is regular sparse. 

We highlight more on the effectiveness for preconditioning. As we have seen from Table HI 
S-PSAI(ioZ) and N-PSAI(toZ) provide comparably effective preconditioners for the first five 
matrices since BiCGStab used comparable iterations iter's to achieve the convergence. This 
is expected, as M's obtained by S-PSAI(ioZ) have similar accuracy to those obtained by 
N-PSAI(toZ) correspondingly. But we should be aware that, for the same accuracy 5, S- 
PSAI(foZ) can generate an M that is considerably denser than that obtained by N-PSAI(toZ); 
see, e.g., cbuckle. This shows that good sparse approximate inverses of an irregular sparse 
matrix A is denser that those of A but PSAI(foZ) can capture effective sparsity patterns of 
A" 1 . This is an advantage of PSAI(toZ) over SPAI for A irregular sparse. 

We comment that taking c = 1 works very well and makes almost all the actual relative 
residual norms defined by ()24|) drop e. The only exception is for trans5, for which a = 1.29. 
But such a indicates that the actual relative residual norm for this problem is very near to 
e, so we can well accept the approximate solution as essentially converged. 

Finally, we are concerned with the overall performance of the algorithms used. The table 
clearly shows that the performance of N-PSAI(ioZ) is very superior to S-PSAI(ioZ), in terms 
of the total computational time that is the sum of pt and st. Since S-PSAI(ioZ) and N- 
PSAI(toZ) provide equally effective preconditioners with comparable sparsity for each A and 
A, it is natural that the solving time st of N-PSAI(foZ) is more than that of S-PSAI(ioZ) by 
noting that N-PSAI(toZ) solves the s + 1 linear systems. Even so, however, the time st of 
solving s + 1 systems is negligible to pt for each problem. 

4.3 Effectiveness comparison of S-SPAI and S-PSAI(toZ) and that of N- 
SPAI and N-PSAI(toZ) 

We attempt to give some comparison of the effectiveness of S-SPAI and S-PSAI(toZ) and 
that of N-SPAI and N-PSAI(toZ). To do so, we take the sparsity as a reference. As we have 
known, all M's in Table 0] satisfy the desired accuracy 5 = 0.4. Now, for each A, we adjust 
the number of most profitable indices per loop and the maximum loop steps Z max , so that 
N-SPAI generates an M whose sparsity is approximately equal to that of the corresponding 
preconditioner obtained by N-PSAI(toZ) in Tabled) We then perform S-SPAI with the same 
parameters to compute a sparse approximate inverse of A's. We aim to show that, given the 
(almost) same sparsity of preconditioners, N-PSAI(toZ) may capture sparsity patterns more 
effectively and produces better preconditioners than N-SPAI does. The same conclusions 
hold for S-PSAI(toZ) and S-SPAI. 

To make each M by N-SPAI be (almost) equally sparse as that obtained by N-PSAI(ioZ) 
for each A, we take the parameters in the SPAI 3.2 code as 

fs_541_3 '-mn 6 -ns 15' 

fs_541_4 '-mn 6 -ns 20' 

rajat04 '-mn 8 -ns 20' 

rajatl2 '-mn 12 -ns 10' 

tols4000 '-mn 5 -ns 20' 
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Table 5: Numerical results obtained by SPAI 





S-SPAI 


N-SPAI 






st 
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st 


SpQiT 




CI 


Q 


fs_541_3 
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01 
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n 1 5 

\J • XtJ 


01 


96 
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X (J 


1 1 


03 


01 


94 


7 


34 




L c\i\ < I I vy j: 


1.32 


01 


56 


20 


99 


1 fi 


02 


50 


12 


77 


4 


Vf\ ~\£\t 1 V, 

L CXi 1 CXiXj J. _ 


2.71 


0.02 


2.05 


42 


0.39 


0.42 


0.09 


1.99 


29 


0.67 


7 


tols4000 


0.10 


0.02 


1.01 


3 


0.44 


0.05 


0.02 


1.02 


4 


0.48 


18 


ASIC_100k 


78h 


10.16 


0.65 


500 


1560 


1807 


12.61 


0.76 


10 


0.92 


122 


del 


* 


* 


* 


* 


* 


7275 


88.25 


1.66 


500 


0.73 


55 


dc2 


* 


* 


* 


* 


* 


6208 


54.33 


1.62 


90 


0.57 


55 


dc3 


* 


* 


* 


* 


* 


5198 


70.63 


1.63 


422 


1.36 


55 


trans4 


* 


* 


* 


* 


* 


3571 


9.23 


1.82 


40 


0.47 


55 


trans5 


* 


* 


* 


* 


* 


3033 


19.52 


1.71 


81 


0.82 


55 



ASIC_100k '-mn 6 -ns 2' 
del '-mn 5 -ns 7' 
dc2 '-mn 5 -ns 7' 
dc3 '-mn 5 -ns7' 
trans4 '-mn 6 -ns 15' 
trans5 '-mn 6 -ns 7' 

where ns = / max in our notation with '-ns j ' denoting ns = j and mn is the number of most 
profitable indices per loop with '-mn j' denoting mn = j. Table [5] reports the results. 

Based on Tables HH3 we can compare the effectiveness of S-SPAI and S-PSAI(toZ) and 
that of N-SPAI and N-PSAI(to0, respectively. 

Obviously, N-PSAI(toZ) improves on N-SPAI for all the test matrices but A resulting from 
rajatl2. The results on the last six larger matrices are more illustrative, where PSAI(toZ) 
exhibited the considerable superiority to SPAI for regular sparse linear systems. Particularly, 
for del, when N-SPAI is applied, BiCGStab used just 500 iterations to achieve the stopping 
criterion (HH) with c = 1, while N-PSAI(toZ) only used 200 iterations; for dc3, N-PSAI(foZ) 
produced a much more effective preconditioner than N-SPAI, and BiCGStab preconditioned 
by N-PSAI(foZ) is seven times faster than that by N-SPAI. The cause is just that given almost 
the same sparsity, PSAI(toZ) better captures effective sparsity patterns, i.e., the positions of 
large entries of A" 1 and A" 1 than SPAI does. This confirms our arguments in the introduction 
and US]. 

For the direct application to the original irregular sparse (pQ), S-PSAI(toZ) shows greater 
improvements over S-SPAI. For ASIC_100k, S-SPAI consumed 78 hours to construct a sparse 
approximate inverse M. But M is of poor quality and BiCGStab failed to converge after 
500 iterations with the actual relative residual norm rr = 1.56 x 10 -3 . The reason should 
be that some columns of M are too sparse to capture enough positions of the large entries 
in the corresponding columns of A" 1 . For the first four matrices, we see from Tables 0HS] 
that two M's for each A have very comparable sparsity but the results clearly illustrate that, 
with a comparable sparsity, S-PSAI(toZ) is considerably more effective for preconditioning 
than S-SPAI for the four matrices. It is eight times, three times, twice and nearly one and a 
half times as fast as S-SPAI for the four problems, respectively, as the corresponding iter's 
indicate. So S-PSAI(ioZ) results in a more substantial acceleration of BiCGStab than S- 
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SPAI. This demonstrates that PSAI(toZ) captures sparsity patterns of A" 1 more effectively 
than SPAI for A irregular sparse and compute the large entries of A -1 reliably, thus leading 
to more effective preconditioners than SPAI when A is irregular sparse. 

Summarizing the above, we conclude that PSAI(toZ) itself is effective for preconditioning 
no matter whether a matrix is regular sparse or not while SPAI appears to be more suit- 
able for regular sparse matrices and may be ineffective when A is irregular sparse. Even for 
regular sparse linear systems, PSAI(ioZ) can outperform SPAI considerably for precondition- 
ing. Taking the construction cost of preconditioners by SPAI and PSAI(toZ) into account, 
to make them computationally practical, we should apply them to regular sparse linear sys- 
tems. Therefore, for an irregular sparse linear system, a good means is to transform it into 
some regular sparse problems, so that SPAI and PSAI(toZ) are relatively efficient to compute 
effective sparse approximate inverses. 

As a last note, we make some comments on the efficiency of SPAI and PSAI(ioZ). Since 
they are different procedures that are derived from different principles and have different 
features, the computational complexity of each of them is quite involved, and the efficiency 
depends on several factors including pattern of A itself. It appears very hard, if not impossible, 
to analyze and compare their costs. Hence we cannot draw any definitive conclusion on the 
efficiency. However, it is seen from Tables [3H5] that for given reasonably parameters N- 
PSAI(toZ) was at least comparable to N-SPAI in efficiency and consumed less CPU time 
to construct M's. As far as the computational efficiency is concerned, there must be cases 
where one procedure wins the other and vice versa. Here we should point out that our 
PSAI(toZ) is written in the Matlab language in single processor while SPAI 3.2 is more or 
less optimized and better programmed in C/MPI designed for parallel computers. So, the 
SPAI 3.2 code itself is better programming than PSAI(ioZ) written in Matlab even in the 
sequential environment. 

A parallel PSAI(ioZ) code in C/MPI or Fortran is involved and will be left as our future 
work. We will expect that the performance of PSAI(toZ) is improved greatly. 

Finally, we point out that our numerical results are for the threshold lOp, by which we 
define irregular columns and regard a column as irregular if the number of its nonzero entries 
exceeds Wp. This threshold is empirical and can be flexible. Actually, we have found that 
our approach is quite insensitive to it. A threshold between 8p ~ 15p works effectively too, 
that is, the resulting A's have similar features as before, and we can obtain similar results by 
using N-SPAI and N-PSAI(ioZ) and draw similar conclusions on them. 

5 Conclusions 

The SPAI and PSAI(toZ) algorithms are quite effective in accelerating Krylov subspace solvers 
for a wide range of real world problems. However, the situation is rather disappointing for 
quite common irregular sparse linear systems. In this case, none of SPAI and PSAI(toZ) 
works well generally due to the very high cost and/or possibly excessive storage requirement 
of constructing preconditioners and/or the ineffectiveness of preconditioners. However, for a 
regular sparse matrix, we have shown that it is possible to use SPAI and PSAI(toZ) to con- 
struct effective preconditioners efficiently. Motivated by this crucial feature and exploiting 
the Sherman-Morrison-Woodbury formula, we have transformed the original irregular sparse 
linear system into some regular sparse ones, for which SPAI and PSAI(toZ) are practical. We 
have considered some practical issues and derived stopping criteria for iterative solutions of 
new systems, so that the approximate solution of the original problem achieves the user pre- 
scribed accuracy. In such a way, we have extended the applicability of SPAI and PSAI(foZ) 
very significantly. Numerical experiments on a number of real world problems with A irreg- 
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ular sparse have demonstrated that our approach works very well and has drastic effects on 
the overall performance of SPAI and PSAI(toZ) preconditioning, compared with SPAI and 
PSAI(foZ) applied to the original (JTJ) directly. 

Our approach may be applicable to factorized sparse approximate inverse preconditioning 
procedures [3]. Although reorderings of A may help such kind of procedures to reduce fill- 
ins, enhance robustness and improve numerical stability when constructing factorized SAI 
preconditioners, they are not always so and in fact even may make things worse for A irregular 
sparse [6]. We believe that our approach may be combined with reorderings to construct 
effective factorized SAI preconditioners for regular sparse linear systems resulting from the 
irregular sparse one. 

We have demonstrated that our approach is more practical and for irregular sparse linear 
systems. Even for A very large with a large number of irregular sparse columns, our approach 
work effectively as the cost of constructing preconditioners overwhelms the cost of solving 
linear systems themselves provided that preconditioners are effective. One might notice that 
the right-hand side of (|2ip is inversely proportional to y/s and smaller than the required 
accuracy e. Nevertheless, the higher accuracy requirements on the s linear systems do not 
cause any difficulty since they decrease quite slowly with s and are no more than two or three 
orders smaller than e even for s ranging from 10 4 ~ 10 6 (note such s means that A is very 
large or huge). 
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